Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44042: [C++][Parquet] Limit num-of row-groups when building parquet for encrypted file #44043

Merged
merged 8 commits into from
Jan 24, 2025

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Sep 10, 2024

Rationale for this change

Limit num-of row-groups when build parquet

What changes are included in this PR?

Limit num-of row-groups when build parquet

Are these changes tested?

No

Are there any user-facing changes?

No

@mapleFU mapleFU requested a review from wgtmac as a code owner September 10, 2024 11:59
@mapleFU
Copy link
Member Author

mapleFU commented Sep 10, 2024

@wgtmac @pitrou would you mind take a look?

For testing, I don't know would it be too slow, if not I'd add one

Copy link

⚠️ GitHub issue #44042 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Member

pitrou commented Sep 10, 2024

Is erroring out the right solution? The row group ordinal is optional and was introduced for encryption. Furthermore, it's not obvious that it should be unique. https://github.com/apache/parquet-format/blob/master/Encryption.md gives no guidelines as to ordinal generation.

@ggershinsky What is the intent here? Should ordinal be unique? But then, should the Parquet reader check for uniqueness? Otherwise a malicious file could still be crafted.

@pitrou
Copy link
Member

pitrou commented Sep 10, 2024

Ok, the C++ reader always uses the physical row group number as the row group ordinal for AAD computation:

constexpr auto kEncryptedRowGroupsLimit = 32767;
if (i > kEncryptedRowGroupsLimit) {
throw ParquetException("Encrypted files cannot contain more than 32767 row groups");
}
CryptoContext ctx(col->has_dictionary_page(), row_group_ordinal_,
static_cast<int16_t>(i), meta_decryptor, data_decryptor);
return PageReader::Open(stream, col->num_values(), col->compression(), properties_,
always_compressed, &ctx);

This begs the question: why is the row group ordinal stored in Thrift metadata if it's simply ignored when reading?

@pitrou
Copy link
Member

pitrou commented Sep 10, 2024

And by the way, the same audit should be done for column and page ordinals.

@wgtmac
Copy link
Member

wgtmac commented Sep 10, 2024

Ok, the C++ reader always uses the physical row group number as the row group ordinal for AAD computation:

Isn't it a bug on the C++ side? It may produces wrong AAD when reading a Parquet file created by ParquetRewriter which binpacks several encrypted files.

EDIT: My bad. ParquetRewriter does not support binpack encrypted files yet.

@pitrou
Copy link
Member

pitrou commented Sep 10, 2024

It may produces wrong AAD when reading a Parquet file created by ParquetRewriter which binpacks several encrypted files.

Hopefully it's not possible to concatenate encrypted files together? Otherwise one can create malicious data (replay attack).

@ggershinsky
Copy link
Contributor

The number of row groups (columns, pages) should be limited for encrypted files only. I don't think this exception should be thrown for unencrypted files.

@ggershinsky
Copy link
Contributor

ggershinsky commented Sep 10, 2024

It may produces wrong AAD when reading a Parquet file created by ParquetRewriter which binpacks several encrypted files.

Hopefully it's not possible to concatenate encrypted files together? Otherwise one can create malicious data (replay attack).

Yep, merging encrypted files is impossible, the AAD has a unique file id component.

@pitrou
Copy link
Member

pitrou commented Sep 10, 2024

Yep, merging encrypted files is impossible, the AAD has a unique file id component.

It seems to be optional and disabled by default according to https://github.com/apache/parquet-format/blob/master/Encryption.md#441-aad-prefix ?

Regardless, my question was about ordinal reuse or shuffling. Should readers verify that ordinals correspond to the physical row group numbers? What is the intent exactly? The spec does not say how these should be handled.

@ggershinsky
Copy link
Contributor

Yep, merging encrypted files is impossible, the AAD has a unique file id component.

It seems to be optional and disabled by default according to https://github.com/apache/parquet-format/blob/master/Encryption.md#441-aad-prefix ?

It's a different parameter, aad_file_unique, a mandatory part of the AAD suffix,
https://github.com/apache/parquet-format/blob/master/Encryption.md#442-aad-suffix
https://github.com/apache/parquet-format/blob/master/Encryption.md#52-crypto-structures

Regardless, my question was about ordinal reuse or shuffling. Should readers verify that ordinals correspond to the physical row group numbers? What is the intent exactly? The spec does not say how these should be handled.

The rg ordinals in thrift are a utility, which can be helpful for readers that split a single file into parallel reading threads. The readers can also run an rg loop/counter. The values will be the same, unless the file is tampered with. Note - the thrift footer is tamper-proof, so the rg ordinals are safe; but the row groups / pages can be shuffled by an attacker. Reading a page from a shuffled rg will throw an exception.
The situation is similar for column and page ordinals. The spec says "The suffix part of a module AAD protects against module swapping inside a file.", but doesn't go into smaller details because they can derived directly from the provided information.

@pitrou
Copy link
Member

pitrou commented Sep 10, 2024

The spec says "The suffix part of a module AAD protects against module swapping inside a file.", but doesn't go into smaller details because they can derived directly from the provided information.

Since we're talking about a security feature, it would be much better if the spec gave clear guidelines instead of letting implementers do the necessary guesswork (and potentially make errors).

@ggershinsky
Copy link
Contributor

I do agree the spec often feels laconic, not only here but in other places too. Expanding it, with ample explanation and reasoning, would result in a different document (implementation guide), much longer than the spec.
However, the spec details are sufficient for a secure implementation. In this example, a reader can use either a row group counter (the term "ordinal" has a well-defined meaning) or the thrift value; either one will support the intended security model. Still, since other ordinals (column and page) don't have a thrift duplicate, I can add a clarification text that makes this point explicit. Will send a patch.

@mapleFU
Copy link
Member Author

mapleFU commented Sep 23, 2024

@ggershinsky Would you mind add a patch to parquet-format? Then I can move forward here

@ggershinsky
Copy link
Contributor

Sure, here it goes apache/parquet-format#453

@mapleFU
Copy link
Member Author

mapleFU commented Sep 29, 2024

apache/parquet-format#453 is merged and I understand this feature is for encryption. However, still this value is written to the metadata, should I:

  1. Just throw like current impl when row-group ordinal is greater than i16::max
  2. Not writing the ordinal when not encryption, so this check can be loosen

@pitrou
Copy link
Member

pitrou commented Sep 29, 2024

Perhaps a quick check that other Parquet readers only use the row group ordinal for encrypted files?

@curioustien
Copy link

Perhaps a quick check that other Parquet readers only use the row group ordinal for encrypted files?

@mapleFU @pitrou I'm trying to pick up this task to ramp up on the parquet codebase. Upon checking the parquet java implementation of this ordinal field in RowGroup, I think there is a small bug/discrepancy in the current cpp implementation vs the java implementation.

For the java implementation, there is this check whether it's an encrypted file or not by checking the fileEncryptor pointer. If it's not an encrypted file, the rowGroupOrdinal value is -1 which I believe it means that Thrift won't set that field. Otherwise, we use the rowGroupOrdinal value which is incremented every time we flush the row group. Though, there is a small bug here in the java implementation that we use int instead of short for int16 ordinal variable. Plus, there are no checks here on integer overflow. In practice, people probably don't hit this limit on the number of row groups anyways, so this bug hasn't showed up yet.

For the cpp implementation, we don't really have a check for the encrypted file like the java implementation when we write row group ordinal (not that I've found, or maybe I'm missing something), so we always write this ordinal value even for unencrypted files. In my opinion, we should change the logic here to behave like the java implementation with an additional check on integer overflow.

What do you both think?

@mapleFU
Copy link
Member Author

mapleFU commented Jan 13, 2025

The idea LGTM

@@ -267,12 +267,13 @@ class SerializedRowGroup : public RowGroupReader::Contents {
ARROW_DCHECK_NE(data_decryptor, nullptr);

constexpr auto kEncryptedRowGroupsLimit = 32767;
if (i > kEncryptedRowGroupsLimit) {
if (ARROW_PREDICT_FALSE(row_group_ordinal_ > kEncryptedRowGroupsLimit)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous check actually checks the column_id, not the row-group ordinal.

@@ -359,14 +359,26 @@ class FileSerializer : public ParquetFileWriter::Contents {
if (row_group_writer_) {
row_group_writer_->Close();
}
int16_t row_group_ordinal = 0;
if (num_row_groups_ < std::numeric_limits<int16_t>::max()) {
row_group_ordinal = static_cast<int16_t>(num_row_groups_);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is a bit ugly. Maybe we can also not write the row_group_ordinal for un-encryption files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think there is no need to write it.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 13, 2025
@mapleFU
Copy link
Member Author

mapleFU commented Jan 13, 2025

@pitrou @wgtmac Sorry for delaying, would you mind take a look?


if (ccmd.__isset.ENCRYPTION_WITH_COLUMN_KEY) {
if (file_decryptor != nullptr && file_decryptor->properties() != nullptr) {
// should decrypt metadata
std::shared_ptr<schema::ColumnPath> path = std::make_shared<schema::ColumnPath>(
ccmd.ENCRYPTION_WITH_COLUMN_KEY.path_in_schema);
std::string key_metadata = ccmd.ENCRYPTION_WITH_COLUMN_KEY.key_metadata;
const std::string& key_metadata = ccmd.ENCRYPTION_WITH_COLUMN_KEY.key_metadata;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also wonder should we check row_group_ordinal >= 0 here? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary, any invalid row group ordinal will fail the AAD check when reading.


if (ccmd.__isset.ENCRYPTION_WITH_COLUMN_KEY) {
if (file_decryptor != nullptr && file_decryptor->properties() != nullptr) {
// should decrypt metadata
std::shared_ptr<schema::ColumnPath> path = std::make_shared<schema::ColumnPath>(
ccmd.ENCRYPTION_WITH_COLUMN_KEY.path_in_schema);
std::string key_metadata = ccmd.ENCRYPTION_WITH_COLUMN_KEY.key_metadata;
const std::string& key_metadata = ccmd.ENCRYPTION_WITH_COLUMN_KEY.key_metadata;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary, any invalid row group ordinal will fail the AAD check when reading.

constexpr auto kEncryptedRowGroupsLimit = 32767;
if (i > kEncryptedRowGroupsLimit) {
constexpr auto kEncryptedOrdinalLimit = 32767;
if (ARROW_PREDICT_FALSE(row_group_ordinal_ > kEncryptedOrdinalLimit)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this is the best place to enforce this. Is it possible to check these while creating the FileMetaData if encrypted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that if we don't request reading data from row_group_ordinal > 32767 and column_ordinal > 32767, we will even not notice that it is a malformed encrypted file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean when contructing ColumnChunkMetaData::ColumnChunkMetaDataImpl ? What if the ENCRYPTION_WITH_COLUMN_KEY is not set but this is being encrypted?

@mapleFU mapleFU changed the title GH-44042: [C++][Parquet] Limit num-of row-groups when build parquet GH-44042: [C++][Parquet] Limit num-of row-groups when build parquet for encrpty file Jan 14, 2025
@mapleFU mapleFU requested review from pitrou and wgtmac January 20, 2025 08:52
@wgtmac wgtmac changed the title GH-44042: [C++][Parquet] Limit num-of row-groups when build parquet for encrpty file GH-44042: [C++][Parquet] Limit num-of row-groups when building parquet for encrypted file Jan 24, 2025
@mapleFU mapleFU merged commit f4a63d4 into apache:main Jan 24, 2025
37 checks passed
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Jan 24, 2025
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit f4a63d4.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 339 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants